Pitfalls of single-study external validation illustrated with a model predicting functional outcome after aneurysmal subarachnoid hemorrhage

Background Prediction models are often externally validated with data from a single study or cohort. However, the interpretation of performance estimates obtained with single-study external validation is not as straightforward as assumed. We aimed to illustrate this by conducting a large number of external validations of a prediction model for functional outcome in subarachnoid hemorrhage (SAH) patients. Methods We used data from the Subarachnoid Hemorrhage International Trialists (SAHIT) data repository (n = 11,931, 14 studies) to refit the SAHIT model for predicting a dichotomous functional outcome (favorable versus unfavorable), with the (extended) Glasgow Outcome Scale or modified Rankin Scale score, at a minimum of three months after discharge. We performed leave-one-cluster-out cross-validation to mimic the process of multiple single-study external validations. Each study represented one cluster. In each of these validations, we assessed discrimination with Harrell’s c-statistic and calibration with calibration plots, the intercepts, and the slopes. We used random effects meta-analysis to obtain the (reference) mean performance estimates and between-study heterogeneity (I2-statistic). The influence of case-mix variation on discriminative performance was assessed with the model-based c-statistic and we fitted a “membership model” to obtain a gross estimate of transportability. Results Across 14 single-study external validations, model performance was highly variable. The mean c-statistic was 0.74 (95%CI 0.70–0.78, range 0.52–0.84, I2 = 0.92), the mean intercept was -0.06 (95%CI -0.37–0.24, range -1.40–0.75, I2 = 0.97), and the mean slope was 0.96 (95%CI 0.78–1.13, range 0.53–1.31, I2 = 0.90). The decrease in discriminative performance was attributable to case-mix variation, between-study heterogeneity, or a combination of both. Incidentally, we observed poor generalizability or transportability of the model. Conclusions We demonstrate two potential pitfalls in the interpretation of model performance with single-study external validation. With single-study external validation. (1) model performance is highly variable and depends on the choice of validation data and (2) no insight is provided into generalizability or transportability of the model that is needed to guide local implementation. As such, a single single-study external validation can easily be misinterpreted and lead to a false appreciation of the clinical prediction model. Cross-validation is better equipped to address these pitfalls. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-024-02280-9.


Measure Explanation
Model performance

Discrimination
The ability to discriminate between high-risk patients and lowrisk patients.Does the model accurately identify those that experience the outcome versus those that do not?
Harrell's c-statistic Used to assess discriminative performance for a binary outcome.

Calibration
The agreement between the predicted risk and the observed risk.
How accurate are the risk predictions?

Calibration slope
The intercept is the ratio between expected and observed outcomes.Ideally, the intercept has a value of 0, whereas a negative value means an overestimation of the predicted risk to the observed risk, and a positive value is an underestimation of the predicted risk to the observed risk.

Calibration intercept
The beta value of the calibration model.Evaluates the spread of the estimated risks and ideally have a value of 1.A value below Introduction

Background and objectives 3a
Explain the medical context (including whether diagnostic or prognostic) and rationale for developing or validating the multivariable prediction model, including references to existing models.

3-4 3b
Specify the objectives, including whether the study describes the development or validation of the model or both.4

Source of data 4a
Describe the study design or source of data (e.g., randomized trial, cohort, or registry data), separately for the development and validation data sets, if applicable.

4b
Specify the key study dates, including start of accrual; end of accrual; and, if applicable, end of follow-up.Suppl

Participants 5a
Specify key elements of the study setting (e.g., primary care, secondary care, general population) including number and location of centres.

Suppl 5b
Describe eligibility criteria for participants.

Suppl 5c
Give details of treatments received, if relevant.NA

Outcome 6a
Clearly define the outcome that is predicted by the prediction model, including how and when assessed.

6b
Report any actions to blind assessment of the outcome to be predicted.NA

Predictors 7a
Clearly define all predictors used in developing or validating the multivariable prediction model, including how and when they were measured.

4-5 7b
Report any actions to blind assessment of predictors for the outcome and other predictors.NA

Sample size 8
Explain how the study size was arrived at. 4

Missing data 9
Describe how missing data were handled (e.g., complete-case analysis, single imputation, multiple imputation) with details of any imputation method.5

Statistical analysis methods 10c
For validation, describe how the predictions were calculated.

Participants 13a
Describe the flow of participants through the study, including the number of participants with and without the outcome and, if applicable, a summary of the followup time.A diagram may be helpful.

4-5 13b
Describe the characteristics of the participants (basic demographics, clinical features, available predictors), including the number of participants with missing data for predictors and outcome.

7-8,
Abbreviations: DCI = delayed cerebral ischemia; mRS = modified Rankin Scale; GOS(E) = Glasgow Outcome Scale (Extended); NIHSS = National Institutes of Health Stroke Scale; NR = not reported; TAR = target aneurysm recurrence.* ALISAH = Albumin in Subarachnoid Hemorrhage Trial; BRANT = the British Aneurysm Nimodipine trial; CONSCIOUS-I = The randomized controlled trials are the Clazosentan to Overcome Neurological Ischemia and Infarction occurring after SAH trial; EPO/Statin = the Acute Systemic Erythropoietin Therapy to Reduce Delayed Ischemic Deficits following SAH, and the Effects of Acute Treatment with Statins on Cerebral Autoregulation in patients after SAH trials; HHU = Heinrich Heine University Concomitant Intraventricular Fibrinolysis and Low-Frequency Rotation After Severe Subarachnoid Haemorrhage trial; IHAST = Intraoperative Hypothermia for Aneurysm Surgery Trial; I-MASH = Intravenous Magnesium Sulphate for Aneurysmal Subarachnoid Haemorrhage trial; ISAT = International Subarachnoid Aneurysm Trial; MAPS = Matrix and platinum science trials; the Tirilazad trials; MASH-I = the Magnesium Sulphate in Aneurysmal Subarachnoid Haemorrhage trials ** CARAT = cerebral aneurysm re-rupture after treatment; the SAH registry of the University of Chicago; D-SAT = the dataset of subarachnoid treatment of the University of Washington; The c-statistic is the proportion of all possible pairs of observations discordant with the outcome (i.e., one with the outcome and one without), in which the subject with the outcome had a higher predicted probability than the one without the outcome.A c-statistic of 0.5 means an uninformative model and a c-statistic of 1 means perfect discrimination.Optimism-corrected c-statistic Obtained through bootstrap validation.The difference between apparent and optimism-corrected c-statistic is called optimism.Optimism is the overestimation of the predictive performance due to modelling random noise in the development data.Model-based c-statistic Case-mix heterogeneity-controlled measure of discriminative performance.The model-based c-statistic assumes the coefficient are correct.The differences between Harrell's c-statistic and the model-based c-statistic can be interpreted as the difference due to case-mix variation.

Table 4 .
Observational Neurocognitive Study from University of Durham, Durham, United Kingdom; the Hospital registry from Kurashiki Central Hospital, Japan; University of Leeds Neurocognitive observation, Leeds, United Kingdom, SHOP = the subarachnoid haemorrhage outcomes project of Columbia University; St Michael's Hospital, Toronto, Canada; the Swiss study on SAH-a nationwide registry of SAH from Switzerland (SWISS); University Medical Centre Utrecht SAH Registry, Utrecht, The Netherlands.Excluded studies and the reason for exclusion.Explanations of performance measures discussed in this study.
ALISAH = Albumin in Subarachnoid Hemorrhage Trial; BRANT = the British Aneurysm Nimodipine trial; CARAT = cerebral aneurysm re-rupture after treatment; Observational Neurocognitive Study from University of Durham, Durham, United Kingdom; I-MASH = Intravenous Magnesium Sulphate for Aneurysmal Subarachnoid Haemorrhage trial; the Hospital registry from Kurashiki Central Hospital, Japan; St Michael's Hospital, Toronto, Canada; the Swiss study on SAH-a nationwide registry of SAH from Switzerland (SWISS).

Table 1 13c
For validation, show a comparison with the development data of the distribution of important variables (demographics, predictors and outcome).Give an overall interpretation of the results, considering objectives, limitations, results from similar studies, and other relevant evidence.10-14Implications20 Discuss the potential clinical use of the model and implications for future research.NA